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ABSTRACT: Recently, Marr and Poggio (1979) presented a theory of human stereo vision. An im- 
plementation of that theory is presented, and consists of five steps: (1) The left and right images a re .each , 
Ritvtred with masks of four sizes that increase with eccentricity; the shape of these masks is given b£v^" 
the laplacian of a gaussian function. (2) Zero-crossings in the filtered images are found along horizontal '.can 
lines. (3) For each mask size, matching takes place between zero-crossings of the same sign and roughly the 
same orientation in the two images, for a range of disparities up to about the width of the mask's central 
region. Within this disparity range, Marr and Poggio showed that false targets pose only a simple problem. 
(4) The output of the wide masks can control vergence movements, thus causing small masks to come into 
correspondence. In this way, the matching process gradually moves from dealing with large disparities at a 
low resolution to dealing with small disparities at a high resolution. (5) When a correspondence is achieved, 
it is stored in a dynamic buffer, called the 2 1 -dimensional sketch. To support the sufficiency of the Marr- 
Poggio model of human stereo vision, the implementation was tested on a wide range of stereograms from 
the human stercopsis literature. The performance of the implementation is illustrated and compared with 
human perception. As well, statistical assumptions made by Marr and Poggio arc supported by comparison 
with statistics found in practice. Finally, the process of implementing the theory has led to the clarification and 
refinement of a number of details within the theory; these are discussed in detail. 
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1. Introduction 

If two objects are separated in depth from a viewer, then die relative positions of their images will 
differ in the two eyes. This difference in relative positions — die disparity — may be measured and used to 
estimate depth. The process of stereo vision, in essence, measures this disparity and uses it to compute depth 
information for surfaces in the scene. 

The steps involved in measuring disparity arc (Marr and Poggio, 1979): (SI) a particular location on a 
surface in the scene must be selected from one' image; (S2) diat same location must be identified in the other 
image; and (S3) the disparity between the two corresponding image points must be measured. The difficulty 
of die problem lies in steps (SI) and (S2), that is, in matching the images of the same location — the so- 
called correspondence problem. For the case of the human stereo system, it can be shown that this matching 
^.^ takes place very early in the analysis of an image, prior to any recognition of what is being viewed, using 

primitive descriptors of the scene. This is illustrated by die example of random dot patterns. Julesz (1960) 
demonstrated that two images, consisting of random dots when viewed monocularly, may be fused to form 
patterns separated in depth when viewed stcrcoscopically. Random dot stereograms arc particularly interesting 
because when one tries to set up a correspondence between two arrays of dots, false targets occur in profusion. 
A false target refers to a possible but incorrect match between elements of die two views. In spite of such 
false targets, and in the absence of any monocular or high level cues, we arc able to determine the correct 
correspondence. Thus, the computational problem of human stcrcopsis reduces to that of obtaining primitive 
descriptions of locations to be matched from the images, and of solving die correspondence problem for diese 
descriptions. 

A computational theory of the stereo process for the human visual system was recently proposed by Marr 
and Poggio (1979). According to this theory, die human visual processor solves die stereoscopic matching 
problem by means of an algorithm that consists of five main steps: (1) The left and right images are each 
filtered at different orientations with bar masks of four sizes that increase with eccentricity; these masks have 
a cross-section that is approximately the difference of two gaussian functions, with space constants in die ratio 
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1:1.75. Such masks essentially perform the operation of a second directional derivative after low pass filtering 
or smoothing, and can be used to detect changes in intensity at different scales. (2) Zero-crossings in the 
filtered images are found by scanning them along lines lying perpendicular to the orientation of the mask. 
Since convolving the image with the masks corresponds to performing a second directional derivative, the 
zero-crossings of die convolutions correspond to cxtrema in the first directional derivative of the image and 
thus to sharp changes in die original intensity function. (3) For each mask size, matching takes place between 
zero-crossing segments of die same sign and roughly the same orientation in the two images, for a range of 
disparities up to about the width of the mask's central region. Within this disparity range, Marr and Poggio 
showed diat false targets pose only a simple problem, because of the roughly bandpass nature of the filters. 
(4) The output of the wide masks can control vergence movements, thus causing smaller masks to come into 
correspondence. In this way, die matching process gradually moves from dealing with large disparities at low 
resolution to dealing with small disparities at high resolution. (5) When a correspondence is achieved, it is 
stored in a dynamic buffer, called the 2 1 -dimensional sketch (Marr and Nishih'ara. 1978). 

An important aspect in die development of any computational theory is the design and implementation 
of an explicit algorithm for dial theory. There are several benefits from such an implementation. One concerns 
die act of implementation itself, which forces one to make all details of the theory explicit. This often uncovers 
previously overlooked difficulties, thereby guiding further refinement of the theory. 

A second benefit concerns the performance of die implementation. Any proposed model of a system 
must be testable. In this case, by testing on pairs of stereo images, one can examine die performance of the 
implementation, and hence of the theory itself, provided, of course, that the implementation is an accurate 
representation of that theory. In this manner, die performance of the implementation can be compared with 
human performance. If die algorithm differs strongly from known human performance, 'its suitability as a 
biological model is quickly brought into question (c.f. die cooperative algorithm of Marr and Poggio (1976)). 

This article describes an implementation of die Marr-Poggio stereo theory, written with particular em- 
phasis on the matching process (Crimson and Marr, 1979). For details of the derivation and justification of the 
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theory, see Marr and Poggio (1979). 

The first part of this paper describes the overall design of the implementation. Several examples of the 
implementation's performance on different images are then discussed, including random dot stereograms from 
the human stereopsis literature such as with one image defocussed, noise introduced into part of the images' 
spectra, and so forth. It is shown that the implementation behaves in a manner similar to humans on these 
special cases. Thirdly, the theory makes some statistical assumptions; diese are compared with the actual 
statistics found in practice. Next, some points.about the theory that were clarified as a result of writing the 
program are discussed. Finally, die results of running the program on some natural images are shown. 
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2. Design of the program 

The implementation is divided into five modules, roughly corresponding to the five steps in the summary 
above. These modules, and die flow of information between them, are illustrated in Figure 1. Each of the 
components is described in turn. 

2.1 Input 

There arc two aspects of the human stereo system, embedded in the Marr-Poggio dieory, which must be 
made explicit in die input to the algorithm. The first is the position of the eyes with, respect to the scene, as eye 
movements will be critical for obtaining fine disparity information. The second is die change in resolution of 
analysis of the image with increasing eccentricity. 

/r> To account for thcsc cffccts ' d,c algorithm maintains as its initial input a stereo pair of images, repre- 

senting die entire scene visible to the viewer. This pair of images corresponds to the environment around the 
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Figure 1. Diagram of the algorithm. The images of the scene are mapped into the images of the retinas, taking 
^ into account the eye positions. Kach image is convolved with a sot of d.IIercnt si/ed masks and zero-crossings 

arc located for each convolution. For each size mask, the left and right /.ero-crossmg descriptions are matched. 
These matched descriptions are combined into a single representation. As well, ihc matches front .he larger 
channels can drive eye vergence movements, causing new retinal images to be created and allowing the smaller 
channels to come into correspondence. 
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visual system, rather than some integral part of the system itself. To create this representation of the scene, 
natural images were digitized on an Optronix Photoscan System P1000. The sizes of these images are indicated 
in the legends. Grey-level resolution is 8 bits, providing 256 intensity levels. For the random dot patterns 
illustrated in this article, the images were constructed by computer, rather than digitized from a photograph. 

For a given position of the eyes, relative to the scene, a representation of the images on the two retinas is 
extracted. The algorithm creates this retinal representation by obtaining a second, smaller pair of images from 
the images representing die whole scene. The mapping from the scene images into the retinal images accounts 
for the two factors inherent in the Marr-Poggio theory. First, different sections of the scenes will be mapped to 
the center (fovea) of the retinal images as the positions of the eyes are varied. Since the matching process will 
take place on die array representing die retinal images, it is important that the coordinate systems of those ar- 
rays coincide with the current positions of the eyes. Note diat the portion of die scene image which is mapped 
into the retinal image may differ for the two eyes, depending on die relative positions of the two optical axes. 
In particular, there may be differences in vertical alignment as well as in horizontal alignment. Second, the 
Marr-Poggio theory also states that the resolution of the earlier stages of die algorithm — die convolution and 
zero-crossings — scales linearly with eccentricity. The most convenient method for dealing with this fact is to 
account for die scaling with eccentricity at the level of the extraction of the images. This means that rather 
dian extracting a set of retinal images in a linear manner, we may map the scene into die rcdnal images by 
a mapping whose magnification varies with eccentricity. By so doing, the later stages of processing need not 
explicitly account for the variation with eccentricity. Rather, these processes arc considered as operating on a 
uniform grid. Note that diis eccentric mapping is not esscndal, especially for small images. In most of the cases 
illustrated in this article, the mapping was not used. 
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After the completion of this stage, the implementation has created a representation of the images that 
has accounted for eye position and for rcdnal scaling with eccentricity. For each pass of the algorithm, 
die matching will take place on the representation of die retinal images, thereby implicitly assuming some 
particular eye positions. Once the matching has been completed, the disparity values obtained may be used to 
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change the positions of the two optic axes, thus causing a new pair of retinal images to be extracted from the 
representations of the scene, and the matching process may proceed again. 

2.2 Convolution 

Given the retinal representations of the images, it is then necessary to transform them into a form upon 
which the matcher may operate. Marr and Poggio (1979) argued that the items to be matched in an image 
must be in one-to-one correspondence with weil-defincd locations on a physical surface. This led to the use of 
image predicates which correspond to changes in intensity. Since these intensity changes can occur over a wide 
range of scales within a natural image, they are detected separately at different scales. This is in agreement with 
the findings of Campbell and Robson (1968), who showed that visual information is processed in parallel by 
^ a number of independent spatial-frequency-tuned channels, and with the findings of Julesz and Miller (1975) 

and Mayhew and Frisby (1976), who showed diat spatial-frcquency-tuned channels arc used in stcrcopsis and 
are independent. Recent work by Wilson and Bergen (1979) and Wilson and Gicsc (1977) provided evidence 
for the particular form of tiiese spatial-frcquency-tuned operators. Measuring, contrast sensitivity to vertical 
line stimuli, Wilson and his collaborators showed that the image is convolved with an operator which in one 
dimension may be closely approximated by a difference of two gaussian functions (DOG). 

In the original theory (Marr and Poggio, 1979), die proposed masks were oriented bar masks whose cross- - 
section was a difference of two gaussians, as gircn by die Wilson and Bergen data. If an intensity change 
occurs along a particular orientation in the image, there will be a peak in the first directional derivative of 
intensity, and a zero-crossing in the second directional derivative. Thus, die intensity changes in die image 
can be located by finding zero-crossings in die output of a second directional derivative operator. However, 
a number of practical considerations have led Marr and Hildrcth (1979) to suggest that die initial operators 
not be directional operators. The only non-directional linear second derivative operator is die Laplacian. Marr 
and Hildrcth have shown that provided two simple conditions on the intensity function in the neighbourhood 
of an edge are satisfied, the zero-crossings of the second directional derivative taken perpendicular to an edge 
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will coincide with the zero-crossings of the Laplacian along that edge. Therefore, theoretically, we can detect 
intensity changes occuring at all orientations using the single non-oriented Laplacian operator. Thus, Marr and 
Hildreth propose that intensity changes occuring at a particular scale may be detected by locating the zero- 
crossings in the output of V 2 G, the Laplacian of a gaussian distribution. The operator, together with its fourier 
transform, is illustrated in Figure 2. The form of the operator is given by: 



V 2 G(r,0) 



r 2 



^Pi-~}. 



Given the form of the operators, it is only left to determine the size of these masks. To do this, we 

first note that Marr and Hildreth (1979) showed that the operator V 2 G is a close approximation to the DOG 

function. Wilson and Bergen's data indicated DOG filters whose sizes — specified by die width w of the filter's 

f\ central excitatory region - range from 3.1' to 21' of visual arc. The variable w is related to the constant a of 

V 2 G by the relation: 

w 



a = 



2^2 

Wilson and Bergen's values were obtained by using oriented line stimuli. To obtain the diameter of the 
corresponding circularly symmetric center-surround receptive field, the values of w must be multiplied by 
\/2. Finally, we want the resolution of the initial images to roughly represent the resolution of processing by 
the cones, and the size of the filters to represent the size of the retinal operators. In the most densely packed 
region of the human fovea, the center-to-ccntcr spacing of the cones is 2.0 to 2.3 fxm, corresponding to an 
angular spacing of 25 to 29 arc seconds (O'Brien, 1951). Accounting for the conversion of Wilson and Bergen's 
data, and using die figure of 27 seconds of arc for the separation of cones in die fovea, one arrives at values of 
w in the range 9 to 63 image elements, and hence, values of a in die range 3 to 23 image elements. 

Recently, it has been proposed (Marr, Poggio and Hildreth, 1979) that a further, smaller channel may be 
present. This channel would have a central excitatory width of w = 1.5', roughly corresponding to 4 image 
elements. 
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Figure 2. The operators G» and V'G. The top left figure show G>, the second derivative of a one-dimensional 
guassian distribution. The top right figure shows V*G, its nationally symmetric two-dimensional counterpart. 
The bottom figures show their Fourier transforms. 
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The present implementation uses four filters, each of which is a radially symmetric difference of 
gaussians, with w values of 4, 9, 17 and 35 image elements. The coefficients of the filters were represented to 
a precision of 1 part in 2048. Coefficients of less than ^-'th f the maximum value of the mask were set to 
zero. Thus, the truncation radius of the mask (the point at which all further mask values were treated as zero) 
was approximately 1.8m;, or equivalent^, 0.68cr. 

The actual convolutions were performed on a LISP machine constructed at the MIT Artificial Intelligence 
Laboratory, using additional hardware specially, designed for the purpose (Knight, et al. 1979). Figures 3 and 4 
illustrates some images and their convolutions widi various sized masks. 

After the completion of this stage of the algorithm, one has four filtered copies of each of the images, each 
copy having been convolved with a different size mask. 
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2.3 Detection and description of zero-crossings 

According to the Man-Poggio theory, die elements that are matched between images arc (i) zero- 
crossings whose orientations are not horizontal, and (ii) terminations. The exact definition and hence the 
detection of terminations is at present uncertain; as a consequence, only zero-crossings arc used as input to the 
matcher. 

Since, for die purpose of obtaining disparity information, we may ignore horizontally oriented segments, 
the detection of zero-crossings can be accomplished by scanning the convolved image horizontally for adjacent 
elements of opposite sign, or for three horizontally adjacent elements, the middle one of which is zero, the 
other two containing convolution values of opposite sign. This gives die position of zero-crossings to within an 
image clement. 

In addition to their location, we record die sign of die zero-crossings (whether convolution values change 
from positive to negative or negative to positive as we move from left to right) and a rough estimate of die 
local, two-dimensional orientation of pieces of the zero-crossing contour. In the present implementation, die 
orientation at a point on a zero-crossing segment is computed as the direction of the gradient of the convolu- 
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Figure 3. Examples of convolutions with V*G. The top figure shows a natural image. The bottom figures show 
the convolution of this image with a set of V'G operators. The sizes of these operators are w = 36, 18, 9 and 
4 image elements. 
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Figure 4. Examples of convolutions with V 2 G. The top figure shows a random dot pattern. The bottom 
figures show the convolution of this image with a set of V 2 G operators. The sizes of these operators are 
w — 36, 18, 9 and 4 image elements. 



15 










t 






Stereo Implementation 16 



E Grimson 



tion values across that segment, and recorded in increments of 30 degrees. Figures 5 and 6 illustrate zero- 
crossings obtained in this way from the convolutions of Figures 3 and 4. Positive zero-crossings are shown 
white, and negative crossings, black. 

We compute this zero-crossing description for each image and for each size of mask. 

2.4 Matching 

The matcher implements the second of the matching algorithms described by Marr and Poggio (1979, 
p.315). For each size of filter, matching consists of 6 steps: 

(1) Fix the eye positions. 

(2) Locate a zero-crossing in one image. 

(3) Divide the region about the corresponding point in die second image into three pools. 

(4) Assign a match to the zero-crossing based on the potential matches within the pools. 

(5) Disambiguate any ambiguous matches. 

(6) Assign the disparity values to a buffer. 

These steps may be repeated several times during the fusion of an image. Given a position for the optic 
axes, these matching steps are performed, widi the results stored in a buffer. These results may be used to 
refine the eye positions, causing a new set of rednal images to be extracted from the scene, and the matching 
steps are performed again. 

We now expand upon each of die six steps of the matching process. The first step consists of fixing the 
two eye positions. The alignment between the two zero-crossing descriptions, corresponding to the positions 
of the optical axes, is determined in two ways. The initial offsets of the descriptions are arbitrarily set to zero. 
Thereafter, the offsets of the two optical axes are determined by accessing the current disparity values for 
a region and using tiiese values to adjust the vergence of the eyes. In this implementation, this is done by 
modifying the extraction of the retinal images from the images of the entire scene, accounting for the positions 
of the optical axes. 
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Figure 5. Examples of zero-crossing descriptions. The top figure show a natural image. The bottom figures 
show the zero-crossings obtained from the convolutions of Figure 3. The white lines mark positive zero- 
crossings and the black lines, negative ones. 
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Figure 6. Examples of zero-crossing descriptions. The top figure show a random dot pattern. The bottom 
figures show the zero-crossings obtained from the convolutions of Figure 4. The white lines mark positive 
zero-crossings and the black lines, negative ones. 
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Once the eye positions have been fixed, end the retinal Images extracted, the images are convolved with 
the DOG filters, and die zero-crossing descriptions are extracted from die convolved images. For a zero- 
crossing description corresponding to a particular mask size, die matching is performed by locating a zero- 
crossing and executing die following operation. Given the location of a zero-crossing in one image, a horizon- 
tal region about the same location in the other image is partitioned into three pools. These pools form the 
region to be searched for a possible matching zero-crossing and consist of two larger convergent and divergent 
regions, and a smaller one lying centrally between them. Together these pools span a disparity range equal to 
2w, where w is the widdi of die central excitatory region of die corresponding two-dimensional convolution 
mask. 
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The following criteria arc used for matching zero-crossings in the left and right filtered images, for each 
pool: 

(1) the zero-crossings must come from convolutions with the same size mask. 

(2) die zero-crossings must have die same sign. 

(3) the zero-crossing segments must have roughly the same orientation. 

A match is assigned on the basis of the number of pools containing a matching zero-crossing. If exactly 
one zero-crossing of the appropriate sign and orientation (within 30 degrees) is found within a pool, the 
location of that crossing is transmitted to die matcher. If two candidate zero-crossings arc found within one 
pool (an unlikely event), the matcher is notified and no attempt is made to assign a match for the point in 
question. If die matcher finds a single crossing in only one of die dircc pools, diat match is accepted, and die 
disparity associated with die match is recorded in a buffer. If two or three of the pools contain a candidate' 
match, die algorithm records that information for future disambiguation. 

Once all possible unambiguous matches have been identified, an attempt is made to disambiguate double 
or triple matches. This is done by scanning a neighbourhood about die point in question, and recording the 
disparity sign of die unambiguous matches within that neighbourhood. (Disparity sign refers to the sign of 
the pool from which die match comes: divergent, convergent or zero.) If the ambiguous point has a potential 
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match of the same disparity sign as the dominant type within the neighbourhood, then that is chosen as the 
match (this is the "pulling" effect). Otherwise, the match at that point is left ambiguous. 

There is the possibility that the region under consideration does not lie within the ±w disparity range 
handled by the matcher. This situation is detected and handled by the following operation. Consider die case 
in which the region does lie within the disparity range ±w. Excluding the case of occluded points, every zero- 
crossing in the region will have at least one candidate match (the correct one) in die other filtered image. On 
die other hand, if the region lies beyond die disparity range ±w, then the probability of a given zero-crossing 
having at least one candidate match will be less than 1. In fact, Marr and Poggio show that the probability of 
a zero-crossing having at least one candidate match in this case is roughly 0.7. We can perform die following 
operation in this case. For a given eye position, the matching algorithm is run for all die zero-crossings. Any 
crossing for which d 1C re is no match is marked as such. If the percentage of matched points in any region is 
less than a threshold of 0.7 dien the region is declared to be out of range, and no disparity values are accepted 
for diat region. 

The overall effect of the matching process, as driven from die left image, is to assign disparity values to 
most of the zero-crossings obtained from the left image. An example of the output appears in Figure 7. In 
diis array, a zero-crossing at position (*, y) with associated disparity d has been placed in a three-dimensional 
array widi coordinate (x, y, d). For display purposes, the array is shown in die figures as viewed from a point 
some distance away. The heights in die figure correspond to die assigned disparities. 

After completion of this stage of the implementation, we have obtained a disparity array for each mask 
size. The disparity values are located only along die zero-crossing contours obtained from diat mask. 



f"\ 



2.5 Vergence Control 

The Marr-Poggio theory states that in order to obtain fine resolution disparity information, it is necessary 
diat the smallest channels obtain a matching. Since the range of disparity over which a channel can obtain 
a match is directly proportional to the size of the channel, this means diat the positions of die eyes must 



Stereo Implementation 23 E. Grimson 



Figure 7. Results of the algorithm. The top stereo pair is an image of a painted coffee jar. The next two figures 
show two orthographic views of the disparity map. The disparities are displayed as {x, y, c — ad(x, y)}, 
where c is a constant and d(x, y) is the difference in the location of a zero-crossing in the right and left images. 
For purposes of illustration, o has been adjusted to enhance the features of the disparity map. The left view 
of the disparity map shows the jar as viewed from the lower edge of the image, and the right view show the 
jar as viewed from the left edge of the image. Note that the background plane appears tilted in the disparity 
map. This agrees with the fused perception. The second stereo pair is a 50% density random dot pattern. 
The bottom figure shows the disparity map as viewed ordiographically from some distance away. All disparity 
maps are those obtained from the w — 4 channel. 
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be assigned appropriately to ensure tliat the corresponding zero-crossing descriptions from the two images 
are within a matchable range. The disparity information required to bring the smallest channels into their 
matchable range is provided by the larger channels. That is, if a region of the image is declared to be out of 
range of fiision by the smaller channels, one can frequently obtain a rough disparity value for that region from 
the larger channels, and use this to verge the eyes. In this way, die smaller channels can be brought into a 
range of correspondence. 

Thus, after die disparities from the different channels have been combined, there is a mechanism for 
controlling vergence movements of the eyes. This operates by searching for regions of tl^ image which do 
not have disparity values for the smallest channel, but which do have disparity values for the larger channels. 
These large channel values are used to provide a refinement to the current eye positions, thereby bringing the 
smaller channels into range of correspondence. Two possible mechanisms for extracting the disparity value 
from a region of the image include using the peak value of a histogram of the disparities in diat neighbour- 
hood, or using a local average of die disparity values. In the current implementation, the search for such a 
region proceeds outwards from die fovea. 

It should be noted here that although the use of disparity information from coarser channels to drive 
eye movements, allowing smaller channels to come into correspondence, is a necessary condition of the Marr- 
Poggio theory, it is not necessarily the only such condition. In other words, there may be other modules of 
die visual system which can initiate eye movements, and thereby affect the input to die matching component, 
by altering the retinal images presented to the matcher. An example of this would be die evidence of Kidd 
et al. (1979) concerning the ability of texture contours to facilitate stcrcopsis by initiating eye movements. 
However, such effects arc somewhat orthogonal to die question of die sufficiency of the matching component 
of die Marr-Poggio dicory, since dicy affect the input to the matcher, but not die actual performance of the 
matching algorithm itself. 
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Once die separate channels have performed their matching, the results are combined and stored in a 
buffer, called the 2^-D sketch. There are several possible methods for accomplishing mis. As far as the Marr- 
Poggio theory is concerned, the important point is that some type of storage of disparity information occurs. 
(Perhaps the strongest argument for this is the fact that up to 2 degrees of disparity can be held fused in the 
fovea.) 

We shall outline two different possibilities for the combination of the different channels. The method 
currently used in die implementation will be described below. A more biologically feasible method will be 
outlined in the discussion. 

One of the critical questions concerning the form of the 2£-D sketch is whether it reflects the scene or the 
retinal images. For all the cases illustrated in this article, the sketch was constructed by directly relating the 
coordinates of the sketch to the coordinates of die images of die entire scene. That is, as disparity information 
was obtained, it was stored in a buffer at the position corresponding to die position in die original scene from 
which the underlying zero-crossing came. Since disparity information about the scene is extracted from several 
eye positions, in order to store dns information into a buffer, explicit information about the positions of the 
eyes is required. It will be argued in die discussion diat this is probably inappropriate as a model of the 
human system. However, for the purposes of demonstrating die effectiveness of the matching module, such a 
representation is sufficient. 

The actual mechanism for storing the disparity values requires some combination of the disparity maps 
obtained for each of the channels. Currently, die sketch is updated, for each region of the image, by writing 
in the disparity values from die smallest channel which is within range of fusion. Vcrgcnce movements arc 
possible in order to bring smaller channels into a range of matching for some region. Further, for diose regions 
of the image for which none of the channels can find matches, modification of die eye positions over a scale 
larger than diat of die vcrgcnce movements is possible. By diis method, one can attempt to bring those regions 
of die image into a range of fusion. 
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There are several possibilities for the actual method of driving the verge rice movements. Two of these 
were outlined in the previous section. 

The final output of the algorithm consists of a representation of disparity values in the image, those 
disparities being restricted to positions in the image lying along zero-crossings segments. 



2.7 Summary of the process 

The complete algorithm, as currently implemented, uses four mask sizes. Initially, the two views of the 
scene are mapped into a pair of retinal images. These images are convolved with each mask. The zero- 
crossings and their orientation arc computed, for each channel and each view. The initial alignments of 
the eyes determine die registration of the images. The matching of the descriptions from each channel is 
performed for this alignment. Any points' with either ambiguous matchings or with no match are marked as 
such. 

Next, the percentage of unmatched points is checked, for all square neighbourhoods of a particular size. 
This size is chosen so as to ensure diat the measurement of the statistics of matching within that neighbour- 
hood is statistically sound. Only the disparity points of those regions whose percentage of unmatched points 
is below a certain threshold, determined by the statistical analysis of Marr and Poggio (1979), are allowed 
to remain. All other points are removed. The values which are kept arc stored into a buffer. At this stage, 
vergence movements may take place, using information from die larger channels to bring die smaller channels 
into a range where matching is possible. Further, if dierc are regions of die image which do not have disparity 
values at any level of channel, an eye movement may take place in an attempt to bring those portions of die' 
image into a range where at least the largest mask can perform its matching. 

Note that the matching process takes place independently for each of die four channels. Once the 
matching of each channel is complete, the results are combined into a single representation of the disparities. 

The final output is thus a disparity map, with disparities assigned along most portions of the zero-crossing 
contours obtained from die smallest masks. The accuracy of the disparities thus obtained depends on how 
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accurately the zero-crossings have been localized, which may, of course, be to a resolution much finer than the 
initial array of intensity values that constitutes the image. 
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3. Examples and Assessment of Performance 

A standard tool in the examination of human stereo perception is the random dot stereogram (Julesz, 
1960, 1971). This is a pair of stereo images where each image, when viewed monocularly, consists only of 
randomly distributed dots, yet when viewed stercoscopically, may be fused to yield patterns separated in 
depth. Such patterns are a useful tool for analysing the stereo component of die human visual system, since, 
there are no visual cues odier dian the stereoscopic ones. We can test the sufficiency of the algorithm by 
comparing human perception with die performance of die algorithm on such patterns. As well, since random 
dot stereograms have well demarked disparity values, it is easy to assess die correctness of die algorithm's 
performance on such patterns. 

Table 1 lists some of the matching statistics for various random dot patterns. These are illustrated in 
Figures 8-13 and discussed below. 

The first pattern consisted of a central square separated in depth from a second plane. The pattern had 
a dot density of 50% and its analysis is shown in Figure 7. Each dot was a square with four image elements 
on a side. For die algorithm, tJiis corresponds to a dot of approximately two minutes of visual arc. The total 
pattern was 320 image elements on a side. The central plane of die figure was shifted 12 image elements in 
one image relative to the other. The final disparity map assigned after the matching of die smallest channel 
had die following statistics. The number of zero-crossing points in die left description which were assigned 
a disparity was 11847. Of these 11847, 11830 were disparity values which were exactly correct, and an 
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hno!sc2 
h noise j 
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density 
5~0%~ 



25% 

To%~ 



5% 



50% 



50% 



50% 



uncorr 



_uncorr2 
uncorrJ 
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50% 



50'; 



_5_0%_ 
50% 
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total 



U847 
966f 
52S6_ 
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11830 
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8683 
63_ 
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5264 



3498 



11095 



_9545_ 

4343 
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1909 
J562f 
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5194 
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6753 



6325 
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additional 14 deviated by one image element from the correct value. Approximately 0.03% of the matched 
points, or roughly 3 points in 10000 were incorrectly matched. 

A similar test was run on patterns with a clot density of 25%, [()% and 5%. The results arc 
Figure 8. 



illustrated in 



For each of these cases, the number of incorrectly matched points was extremely low. Those points 
which were assigned incorrect disparities all occurcd at the border between the two planes, that is, along the 
discontinuity in disparity. 

A more complex random dot pattern consisted of a wedding cake, built iron, four different planar layers, 
each separated by 8 image elemems, or 2 dot widths. This is illustrated in Figure 9. 

In this case, the number of ,ero-crossing points assigned a disparity was 1 1 102. Of these points. 1 1095 
were assigned a disparity value which was exactly correct, and an additional 6 1 deviated (Von, the correct value 
by one image element. Approximately 0G% of the points were incorrectly matched. Again, these incorrect 
points all occured at the boundaries between the planes. A second complex pattern is illustrated in Figure 9. 
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. ' F.gure 8. The top stereo pair is a 25% density random dot pattern. The disparity map below it is displayed as 

in Figure 7. The bottom stereo pair is a 5% density random dot patten, Its disparity map is shown below it. 
Both disparity maps are obtained from the w — 4 channel. 
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Figure 9. The top stereo pair is a 50% density wedding cake, composed of four planar levels. The disparity 
map is shown below it. The bottom stereo pair is a 50% spiral. The disparity map is shown below it, in a 
manner similar to Figure 7. Both disparity maps are obtained from the w = 4 channel. 
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The object is a spiral with a range of continuously varying disparities. 

There are a number of special cases of random dot patterns which have been used to test various aspects 
of die human visual system. The algorithm was also tested on several of these stereograms. They are outlined 
below and a comparison between the performance of the algorithm, and human perception is given. 

It is known that if one or both of the images of a random dot stereogram are blurred, fusion of the 
stereogram is still possible (Julesz 1971, p.96). To test die algorithm in this case, the left half of a 50% density 
pattern was blurred by convolution with a gaussian mask. This is illustrated in Figure 10. The disparity 
values obtained in this case were not as exact as in the case of no blurring. Rather, there was a distribution 
of disparities about the known correct values. As a result, the percentage of points that might be considered in- 
correct (more than one image element deviation from the correct value) rose to 6%. However, the qualitative 
performance of the algorithm is still that of two planes separated in deptii. It is interesting to note tiiat slight 
disuibution of disparity values about those corresponding to the original planes is consistent with die human 
perception of a pair of slightly warped planes. 

Julesz and Miller (1975) showed that fusion is also possible in die presence of some types of masking 
noise. In particular, if the spectrum of the noise is disjoint from the spectrum of the pattern, it can be 
demonstrated that fusion of the pattern is still possible. Within the framework of the Marr-Poggio tiieory, this 
is equivalent to stating that if one introduces noise of such a spectrum as to interfere with one of the stereo 
channels, fusion is still possible among the other channels, provided die noise does not have a substantial 
spectral component overlapping other channels as well. This was tested on the algorithm by high pass filtering 
a second random dot pattern, to create die noise, and adding the noise to one image. In the case illustrated in 
Figures 10 and 11, the spectrum of die noise was designed to interfere maximally with the smallest channel. 
In the case shown by HNOISE1 and HNOISE2 in Table 1, die noise was added such that the maximum 
magnitude of the noise was equal to the maximum magnitude of the original image. HNOISE1 illustrates 
the performance of the smallest channel. HNOISE2 illustrates the performance of the next larger channel. It 
can be seen that for this case, some fusion is still possible in the smallest channel, altiiough it is patchy. The 
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Figure 10. The top stereo pair is a 50% density pattern in which the left image has been blurred. The disparity 
map is shown below it. It can be seen that two planes are still evident, although they are not as sharply defined 
as in Figure 7 or Figure 8. The disparity map is that obtained from the w = 4 channel. The bottom stereo 
pair is a 50% density pattern. The left image has had high pass filtered noise added to it so that the maximum 
magnitude of the noise is equal to the maximum magnitude of the image. The disparity map shown is that 
obtained by the w — 9 channel. 
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Figure 1.1. The top stereo pair is a 50% density pattern. The left image has ahd high pass filtered noise added 
to it so that the maximum magnitude of the noise is half the maximum magnitude of the image. The top 
disparity map is that obtained from the w — 9 channel, while the next disparity map is that obtained from the 
w — 4 channel. It can be seen that the w = 4 channel obtains a matching only in a few sections of the image. 
The bottom stereo pair is a 50% density pattern in which the left image has been compressed in the horizontal 
direction. The disparity map from the w = 4 is displayed below. It can be seen that the two planes are still 
evident, although the entire pattern appears slanted. This is in agreement with human perception. 
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next larger channel also obtains fusion. In botii cases, the accuracy of the disparity values is reduced from 
the normal case. This is to be expected, since the introduction of noise tends to displace the positions of the 
zero-crossings. In the case shown by HNOISE3 and HNOISE4 in Table 1, the noise was added such that the 
maximum magnitude was twice that- of the maximum magnitude of the original image. Here, matching in the 
smallest channel is almost completely eliminated (HNOISE3). Yet matching in the next larger channel is only 
marginally affected (HNOISE4). . 

The implementation was also tested on the case of adding low pass filtered noise to a random dot pattern, 
with results similar to that of adding high pass filtered noise. Here, die larger channels are unable to obtain a 
good matching, while the smaller channels are relatively unaffected. 

If one of die images of a random dot pattern is compressed in die horizontal direction, the human stereo 
system is still able to achieve fusion (Julesz 1971, p.213). The algorithm was tested on this case, and the results 
arc shown in Figure 11. It can be seen that the program still obtains a reasonably good match. The planes are 
now slightly slanted, which agrees with human perception. 

If some of die dots of a pattern are decorrelatcd, it is still possible for a human observer to achieve 
some kind of fusion (Julesz 1971, p.88). Two different types of decollation were tested. In the first type, 
increasing percentages of the dots in die left image were decorrelatcd at random. In particular, the cases of 
10%, 20% and 30% were tried, and are illustrated in Figure 12. For the 10% case, (table entry Uncorrl) 
it can be seen diat the algoridim was still able to obtain a good matching of the two planes, although the 
total number of zero-crossings assigned a disparity decreased, and the percentage of incorrectly matched 
points increased. When the percentage of decorrelatcd dots was increased to 20% (table entry Uncorr2), die- 
number of matched points decreased again, although the percentage of those which were incorrectly matched 
remained about the same. Finally, when die percentage of decorrelatcd dots was increased to 30% (table entry 
Uncorr3), die algoridim found virtually no section of die image which could be fused. 

The failure of the algoridim to match die 30% decorrelatcd pattern is caused by die component of die 
algorithm which checks diat each region of die image is within range of correspondence. Recall diat in order 
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f~S Figure 12. The top stereo pair is a 50% density pattern in which the left image has had 10% of the dots 

decorrelatcd. The disparity map is shown below. The bottom stereo pair is a 50% density pattern in which 
the left image has had 20% of the dots decorrelatcd. The disparity map is shown below. Note that in this case 
there are large regions of the image for which no match was made. 
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to distinguish between tlie case of two images beyond range of fusion (for the current eye positions) whieh 
will have only randomly matching zero-crossings, and thecase of two image within range of fusion, the Marr- 
Poggio theory requires that the percentage of unmatched points is less than some threshold. This threshold 
is approximately 0.3, according to die statistical analysis of Marr and Poggio (1979). For the case of the 
pattern with 30% decorrelation, on the average, each region of the image will have roughly 30% of its zero- 
crossings different and hence the algorithm decides that the region is out of range of correspondence. Hence, 
no disparitites are accepted for this region. 

For the algorithm, the computational reason for the failure to process patterns with 30% decorrelation 
is that it could not distinguish a correctly matched region of such a pattern from a region which was out of 
range of correspondence, but had a random set of matches for many of the pointsin the region. It is interesting 
to note that many human subjects observe a similar behavior; that is, some kind of fusion for up to 20% 
decorrelation, although the fusion becomes increasingly weaker, and virtually no. fusion for patterns with 30% 

decorrelation. 

One can also decorrelate the pattern by breaking up all white triplets along one set of diagonals, and 
all black triplets along the other set of diagonals (Julesz 1971, p.87). The table entry Uncorrd indicates the 
matching statistics for this case. Again, it can be seen that the program still obtains a good match, as do human 
observers. The performance of the algorithm is illustrated in Figure 13. 



4. Statistics 



A number of parameters are important for the theory, which makes assumptions about them, and they 
have been measured on random dot images. The worst cases occur for patterns with a density of 50%, and 
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Figure 13. The top stereo pair is a 50% density pattern in which the left image has been diagonally decorre- 
lated. Along one set of diagonals, every triplet of white dots has been broken by the insertion of a black dot, 
and along the other set of diagonals, every triplet of black dots has been broken by the insertion of a white dot. 
The disparity map is shown below. The bottom stereo pair is a special case of Panufn's limit. The left image is 
formed by superimposing two slightly displaced copies of the right image. The disparity map is shown below, 
and consists of two superimposed planes. 
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fAIUT. OF STATISTICS 



parameter 



average distance 
between zero-crossings 
of same sign 



expected worst 
case-behavior 



probability of 
candidates in at 
most one pool 



probability of 
candidates in 
two pools 



probability of 
candidates in all 
three pools 



given a candidate 

near zero, 
probability of no 
other candidates 



2 w 



>.50 



<.45 



<.05 



>.9 



large channel 
w - J 5 



1.51 w 



.77 



.21 



medium channel 

w = 17 



.02 



.33 



1.88 w 



.75 



.25 



.01 



.85 



small channel 
1.87 w 



.69 



.31 



.8/ 



J 



Table 2. 



for such patterns the worst case values encountered for the parameters have the values shown in Table 2. The 
theoretical worst case bounds used by Marr and Pojjgio appear for comparison. 
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5. Comments and Discussion ,., 

Implementing a computational theory offers us die opportunity of testing its adequacy. In this case, I 
have found that the performance of the implementation coincides well with that of human subjects over a 
broad range of random dot test cases obtained from the literature, including defocussing of, compression of, 
and &e introduction of various kinds of masking noise to one image of a random dot stereo pair. 

The process of implementing the theory also led to the following observations and refinements of the 
theory. • 

(1) There are a number of questions concerning die form of die 2_-D sketch. The first critical question 
concerns whether the sketch reflects the initial or die retinal images. In die first case, the coordinates of the 
sketch would be directly related to die coordinates of die images of the endre scene. However, since disparity 
information about the scene is extracted from several eye positions, in order to store this information into a 
buffer with coordinate system connected to the image of die scene, explicit information about die positions of 
the eyes is required. For die computer implementation, this is possible, but for a model of the human visual 
system, it seems unlikely diat such information is available to the stereo process. In the second case, no such 
problem arises. Here, die coordinates of the sketch are directly related to the coordinates of the retinal images. 
Such a system would be rednocentric, reflecting the current positions of the eyes. This seems to be the most 
natural representation. 

The second qucsdon concerns the use of a fovea. Different sections of die images are analyzed at different 
resolutions, for a given position of the optical axes. An important consequence of diis is that die amount of 
buffer space required to store the disparity will vary widely in die visual field, being much greater for the fovea 
than for the periphery. This also suggests the use of a rednocentric representation, because if one used a frame 
that had already allowed for eye-movements, it would have to have fovcal resolution everywhere. Not only 
does such a buffer waste space, but it does not agree with our own experience as perccivers. If such a buffer 
were used, we should be able to build up a perceptual impression of the world diat was everywhere as detailed 
as it is at the centre of die gaze, and this is clearly not the case. 
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The final point about the 2<-D sketch is that it is intended as an intermediate representation of the 
current scene. It is important for such a representation to pass on its information to higher level processes as 
quickly as possible. Thus, it probably cannot wait for a representation to be built up over several positions 
of the eyes. Rather, it must be refreshed for each eye position. Thus, a refinement to the implementation, as 
outlined above, would be to use a representation that is retinocentric, and which represents disparities with 
decreasing resolution as eccentricity increases. 

For the cases illustrated in this article, the 2*-D sketch was created by storing fine resolution disparity 
values into a scene-centered representation. A second alternative is to store values from all channels into 
retinocentric representation, using disparity values from the smaller channels where available, and the coarser 
disparities from the larger channels elsewhere. In this way, a disparity representation for a single fixation of die 
eyes may be constructed, with disparity resolution varying across the rcdna. Such a mcdiod of creating the 2*- 
D sketch has been tested on the implementation, with good results. 

(2) The neighbourhood over which a search for a matching zero-crossing is conducted is broken into 
three pools. In the present implementation, the pools are used to deal with die ambiguous case of two 
matching zero-crossings, while the disparity values associated with a match are represented to within a image 
element. A second possibility is to use the pools not only to disambiguate multiple matches, but also to assign 
a disparity to a match. Thus, a single disparity value, equal to the disparity value of die midpoint of the 
pool, would be assigned for a matching zero-crossing lying anywhere within the pool. In this scheme, only 
three possible disparities could be assigned to a zero-crossing: zero, corresponding to the middle pool, or ^, 
corresponding to the divergent or convergent pools. 

Computer experiments show that eidier scheme will work. In die case of a single disparity value for each 
pool, the disparities assigned by the smallest channel are within an image element of diose obtained using 
exact disparities for each match. This modification was tried on both natural images and random dot patterns, 
and suggests that die accuracy with which the pools represent die match is not a critical factor. 

(3) Although die Marr-Poggio matcher is designed to match from one image into die other, there i 
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inherent reason why the matching process cannot be driven from both eyes independently. In fact, diere 
may be some evidence that diis is so, as is shown by die following experiment of O. Braddick (1978) on an 
extension to Panum's limiting case. First, a sparse random dot pattern was constructed. From diis pattern, a 
partner was created by displacing the entire pattern by slight amounts to bodi the left and the right. Thus, for 
each dot in the right image, there corresponded two dots in the left image, one with a small displacement to 
the left and one with a small displacement to the right. The perception obtained by viewing such a random dot 
stereogram is one of two superimposed planes. 

Suppose die matching process were only driven from one image, for example, matches were made from 
the right image to the left. In this case, the implementation would not be able to account for die Braddick 
perception, since all die zero-crossings would have two possible candidates. However, suppose that the match- 
ing-proccss were driven independently from both die right and left images, and an unambiguous match from 
either side accepted. In diis case, although every zero-crossing in the right image would have an ambiguous 
match, die implementation would obtain a unique match for each zero-crossing in the left image. The 
implementation was designed to account for matching from either image. 

Braddick's case has been tested on die implementation, and die results arc shown in figure 13. It can be 
seen that the results of die implementation arc that of two transparent planes. 

(4) The points that were incorrectly matched in die test cases all lay along depth discontinuities. The 
major reason for this is connected with occlusion of regions. Note that at any depth discontinuity, diere will 
be an occluded region which is present in one image, but not die other. Any zero-crossings within diat region 
cannot, of course, have a matching zero-crossing in the other image. However, there is a certain probability 
of such a zero-crossing being matched incorrectly to a random zero-crossing in die other image. In principle, 
die algorithm detects regions which arc occluded, by checking die statistics of die number of unmatched zero- 
crossings, and using such results to mark all zero-crossing matches in the region as unknown. However, for a 
region which contains a depth discontinuity, only part of die region will have die above characteristics. Zero- 
crossings in the rest of die region will have a unique match. Thus, when the statistical check on die number 
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of unmatched points is performed, it is possible for the entire region to be considered in range, and thus all 
matches, including the incorrect ones of the occluded region, will be accepted. 

(5) It is interesting to comment on the effect of depdi discontinuities for the different sized masks. For 
random dot patterns, the zero-crossings obtained from the larger masks tend to outline blobs or clusters of 
dots. Thus in general, die positions of the zero-crossings do not correspond to single elements of the underly- 
ing image. Suppose die dot pattern consists of one plane separated in depth from a second plane. In such a 
case, one might well find a zero-crossing that belongs at one end to dots on the first plane, and at the other end 
to dots belonging to the second plane. Such zero-crossings will be assigned disparities that reflect, to within 
the resolution of die channel, the structure of the image. The zero-crossings lying between die two ends will, 
however, receive disparities diat smoodily vary from one extreme to the other. The largest channel would thus 
not see a plane separated in depth from a second plane, but rather a smooth hump. 

For die smaller mask diis does not occur, as the zero-crossing contours tend to outline individual dots or 
connected groups of dots. Thus die disparities assigned are such that the dots belong to one plane or die other 
and the final disparity map is one of two separated planes. 

To achieve perfect results from stereo, it is probably necessary to include in die 2 1 -dimensional sketch 
a way of dealing competently with discontinuities. Some initial work has already been done in this direction 
(Grimson, in preparation). Interestingly, when one looks at a 5% random-dot stereogram portraying a square 
in front of its background, one sees vivid subjective contours at its boundary, although die output of the 
matcher does not account for this. 

(6) One consequence of the Marr-Poggio dieory is that explicit disparity values will be obtained only 
along the zero-crossing contours. It may be desirable to create a more complete reconstruction of the shapes of 
the objects in die scene, by filling in disparity vah.es between die zero-crossing contours. Some work has been 
done in this direction (Grimson, in preparation) and an example is shown in Figure 14. 

(7) An integral part of most computational theories, proposed as models of aspects of the human visual 
system, is die use of computational constraints based on assumptions about the physical world (Marr and 
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Figure 14. Example of filling in the disparity map. The top left figure is the initial image. The top right figure 
gfm^ shows the disparity map associated with the image, where the disparity is represented by the intensity of the 

point. The bottom figures show the filled in map, again using intensity to represent disparity. In the left figure, 
the full range of disparity is shown, indicating the slant of the background plane, and the extreme difference in 
disparity between the jar and die background. In the right figure, die intensities have been adjusted to enhance 
die disparities of the jar, indicating die general shape of the interpolated surface. 
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Poggio, 1979, Marr and Hildreth, 1980, Ullman, 1979). The constraints so derived are critical in die formation 
of the computational theory, and in the design of an algorithm for solving die problem. An interesting ques- 
tion to raise is whether the algorithm explicitiy checks that die constraints imposed by the theory are satisfied. 
For example, Ullman's rigidity constraint in die analysis of structure from motion is explicitly checked by his 
algorithm. For the case of the Marr-Poggio stereo dieory, two constraints were outlined, uniqueness and con- 
tinuity of disparity values. It is curious that in the algorithm used to solve the stereo problem, the continuity 
constraint is explicity checked while the uniqueness constraint is not. Uniqueness of disparity is required in 
one direction of matching, since only those zero-crossing segments of one image which have exactly one match 
in die second image are accepted. However, it may be the case that more than one element of the right image 
could be matched to an element of the left image, for matching in this direction. When matching from the 
right image to the left, the same is true. Note that one could easily alter the algoritiim to include the checking 
of uniqueness, thereby retaining only those disparity values corresponding to zero-crossing segments with a 
unique disparity value when matched from bodi images. However, the evidence of Braddick discussed above 
would indicate that this is not the case. Hence, in the Marr-Poggio stereo theory, although both the require- 
ment of uniqueness and continuity are subsumed, only one of these two constraints is explicitly checked by the 
algorithm. 

(8) It is worth observing the distinction between the performance of the implementation on random 
dot patterns and the performance of the implementation on natural images. Some examples are shown in 
Figure 15. The main point is that on the whole, the performance is quite acceptable for random dot patterns. 
However, die implementation can occasionally fail in the case of natural images. The question is whether this 
reflects a basic inadequacy in the theory and its implementation, or whether tiiere are other aspects of the 
visual process interacting with stereo which have not been included in this implementation. 

This can be approached in two ways: (1) Is the assumption of modularity incorrect? In other words, is 
there something wrong with die matching module as developed by Marr and Poggio, and as implemented 
here. (2) Are there other modules, not considered here, which may affect the input or the output of the 
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Figure 15. Examples of natural images. The top stereo pair is a scene of a basketball game. The disparity map 
below is viewed from the side, so that the width of the black bars indicates the relative disparity. The bottom 
stereo pair is of a sculpture by Henry Moore. The disparity maps below it are also viewed from the side. The 
left map illustrates the extreme range of disparity between the trees in the background and the sculpture itself. 
The right map has been adjusted to enhance the disparities of the sculpture, indicating its form. 
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The results of testing the implementation on the broad range of images, indicated in previous sections, 
seems to indicate that the matching module is acceptable as an independent one. In particular, the agreement! 
between the performance of the algorithm and that of human observers on the many random dot patterns 
seems to indicate that the matching module is acceptable, since in these cases, all other visual cues have been 
isolated from the matcher. 

When we turn to natural images, it is reasonable to expect that other visual modules may affect the input 
to the matcher and that they may alter the output of the matcher. This is not to suggest that the matcher is 
incorrect, only that the effects of otiier modules must be taken into account in order to explain the complete 
human perception. For example, die evidence of Kidd, Frisby and Mayhcw (1979) concerning die ability of 
texture boundaries to drive eye vergence movements indicates that other visual information besidesriisparity 
may alter the position of die eyes, and thus the input to die matcher. However, it does not necessarily imply 
that die matcher itself needs to be modified. 
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Interestingly, the performance of the implementation supports this point. The implementation, which is 
considered a distinct module, also performs very well on random dot patterns, where Uierc is no possibility 
of interaction with otiier visual processes. For many natural images, this is still true. However, occasionally 
it is the case that a natural image provides some difficulty for the implementation. A particular example of 
this occurs in the image of Figure 16 . Here, the regular pattern of the windows provides a strong false targets 
problem. In running die implementation, the following behavior was observed. If the optical axes were aligned 
at the level of the building, the zero-crossings corresponding to die windows were all assigned a correct dis- 
parity. If, however, the optical axes were aligned at die level of die trees in front of die building, the windows 
were assigned an incorrect disparity, due to die regular pattern of zero-crossings associated with them. Clearly, 
this seems wrong. Yet is die implementation wrong? Curiously, if one fuses the zero-crossing descriptions of 
the convolved images without eye movements, human observers have the same problem: if die eyes arc fixated 
at the level of the building, the windows arc correctly matched; if die eyes arc fixated at the level of the trees, 
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Figure 16. The false targest problem. The top figures arc a stereo pair of a group of buildings. The bottom 
figures show the zero-crossing descriptions of these images. The regular pattern of the windows of the rear 
O Chiding causes difficulties for the matcher. If the alignment of the eyes corresponds to fixating at the level 

of the building, the algorithm marches the zero-crossings corresponding to the windows correctly. If the 
alignment of die eyes corresponds to fixating at the level of the trees in front of (he building, the algorithm 
matches the zero-crossings corresponding to the windows incorrectly. 1-xperiments indicate that under similar 
conditions humans have a similar perception. 
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the windows arc incorrectly matched. I would argue that this implies that the implementation, and hence the 
theory of the matching process is in fact correct. Given a particular set of zero-crossings, the module finds 
any acceptable matching and writes it into the 2£-D sketch. However, it is probably the case that some later 
processing module, which examines the contents of the 2^-D sketch, is capable of altering the contents stored 
there, based on more global information than is available to the matching component of the stereo process. 

Thus, I would suggest that future refinements to the Marr-Poggio theory must account for die interac- 
tions of other aspects of visual information processing on the input and output of the matching module. Some 
initial work has already been done in this direction (Grimson, in preparation). 



6. Acknowledgements 

Without David Marr and Tomaso Poggio, this work would have been impossible. Ellen Hildreth, Keith 
Nishihara and Shimon Ullman provided many useful comments and suggestions. 



7. References 

Braddick, 0. 1978 Multiple matching in stcreopsis. (unpublished MIT report). 

Campbell, F.W. and Robson, J. 1968 Application of Fourier analysis to the visibility of gratings. / 
Physiol., Lond 197, 551-566. 

Grimson, W.E.L. A refinement of a computational theory of human stereo vision in preparation. 



C 



e 



f 



Stereo Implementation 59 F C ' 



Grimson, W.E.L. and Marr, D. 1979 A computer implementation of a theory of human stereo vision. 
Proceedings: Image Understanding Workshop 41-47. 

Julesz, B. 1960 Binocular depth perception of computer-generated patterns. Bell System Tech. J. 39, 1125- 
1162. 

Julesz, B. 1971 Foundations ofcyclopean perception. Chicago: The University of Chicago Press. 
Julesz, B. and Miller, J.E. 1975 Independent spatial-frequency-tuned channels in binocular fusion and 
rivalry. Perception 4 125-143. 

Kidd, A.L., Frisby, J.P. and Mayhew, J.E.W. 1979 Texture contours can facilitate stcreopsis by initiating 
appropriate vergencc eye movements. Nature 280, 829-832. 

Knight, T.F., Moon, D.A., Holloway, J., and Steele, G.L. 1979 CADR MIT Artificial Intelligence 
Laboratory Memo 528. 

Marr, D. and Hildreth, E. 1980 Theory of edge detection. Proc. R. Soc. Lond. (in the press). 

Marr, D. and Nishihara, H.K. 1978 Representation and recognition of the spatial organization of three- 
dimensional shapes. Proc. R. Soc. Lond B. 200, 269-294. 

Marr, D. and Poggio, T. 1976 Cooperative computation of stereo disparity. Science, N. Y. 194, 283-287. 

Marr, D. and Poggio, T. 1979 A computational theory of human stereo vision. Proc. R. Soc. Lond. B. 204, 
301-328. 

Marr, D., Poggio, T. and Hildreth, E. 1979 The smallest channel in early human vision. JOS A (submitted 
for publication). 

Mayhew, J.E.W. and Frisby, J.P. 1976 Rivalrous texture stereograms. Nature, Lond 264, 53-56. 
O'Brien, B. 1951 Vision and resolution in the central retina. J. Opt. Soc. Am. 41, 882-894. 
Ullman, S. 1979 The interpretation of visual motion Cambridge: MIT Press. 

Wilson, H.R. and Bergen, J.R. 1979 A four mechanism modcifor spatial vision. Vision Res. (in the press). 
Wilson, H.R. and Giesc, S.C. 1977 Threshold visibility of frequency gradient patterns. Vision Res. 17, 
1177-1190. 



X) E Grimson 

Stereo Implementation ^ 



to distinguish between tlie case of two images beyond range of fusion (for the current eye positions) whieh 
will have only randomly matching zero-crossings, and thecase of two image within range of fusion, the Marr- 
Poggio theory requires that the percentage of unmatched points is less than some threshold. This threshold 
is approximately 0.3, according to die statistical analysis of Marr and Poggio (1979). For the case of the 
pattern with 30% decorrelation, on the average, each region of the image will have roughly 30% of its zero- 
crossings different and hence the algorithm decides that the region is out of range of correspondence. Hence, 
no disparitites are accepted for this region. 

For the algorithm, the computational reason for the failure to process patterns with 30% decorrelation 
is that it could not distinguish a correctly matched region of such a pattern from a region which was out of 
range of correspondence, but had a random set of matches for many of the pointsin the region. It is interesting 
to note that many human subjects observe a similar behavior; that is, some kind of fusion for up to 20% 
decorrelation, although the fusion becomes increasingly weaker, and virtually no. fusion for patterns with 30% 

decorrelation. 

One can also decorrelate the pattern by breaking up all white triplets along one set of diagonals, and 
all black triplets along the other set of diagonals (Julesz 1971, p.87). The table entry Uncorrd indicates the 
matching statistics for this case. Again, it can be seen that the program still obtains a good match, as do human 
observers. The performance of the algorithm is illustrated in Figure 13. 



4. Statistics 



A number of parameters are important for the theory, which makes assumptions about them, and they 
have been measured on random dot images. The worst cases occur for patterns with a density of 50%, and 
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Figure 13. The top stereo pair is a 50% density pattern in which the left image has been diagonally decorre- 
lated. Along one set of diagonals, every triplet of white dots has been broken by the insertion of a black dot, 
and along the other set of diagonals, every triplet of black dots has been broken by the insertion of a white dot. 
The disparity map is shown below. The bottom stereo pair is a special case of Panufn's limit. The left image is 
formed by superimposing two slightly displaced copies of the right image. The disparity map is shown below, 
and consists of two superimposed planes. 
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fAIUT. OF STATISTICS 



parameter 



average distance 
between zero-crossings 
of same sign 



expected worst 
case-behavior 



probability of 
candidates in at 
most one pool 



probability of 
candidates in 
two pools 



probability of 
candidates in all 
three pools 



given a candidate 

near zero, 
probability of no 
other candidates 



2 w 



>.50 



<.45 



<.05 



>.9 



large channel 
w - J 5 



1.51 w 



.77 



.21 



medium channel 

w = 17 



.02 



.33 



1.88 w 



.75 



.25 



.01 



.85 



small channel 
1.87 w 



.69 



.31 



.8/ 



J 



Table 2. 



for such patterns the worst case values encountered for the parameters have the values shown in Table 2. The 
theoretical worst case bounds used by Marr and Pojjgio appear for comparison. 
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5. Comments and Discussion ,., 

Implementing a computational theory offers us die opportunity of testing its adequacy. In this case, I 
have found that the performance of the implementation coincides well with that of human subjects over a 
broad range of random dot test cases obtained from the literature, including defocussing of, compression of, 
and &e introduction of various kinds of masking noise to one image of a random dot stereo pair. 

The process of implementing the theory also led to the following observations and refinements of the 
theory. • 

(1) There are a number of questions concerning die form of die 2_-D sketch. The first critical question 
concerns whether the sketch reflects the initial or die retinal images. In die first case, the coordinates of the 
sketch would be directly related to die coordinates of die images of the endre scene. However, since disparity 
information about the scene is extracted from several eye positions, in order to store this information into a 
buffer with coordinate system connected to the image of die scene, explicit information about die positions of 
the eyes is required. For die computer implementation, this is possible, but for a model of the human visual 
system, it seems unlikely diat such information is available to the stereo process. In the second case, no such 
problem arises. Here, die coordinates of the sketch are directly related to the coordinates of the retinal images. 
Such a system would be rednocentric, reflecting the current positions of the eyes. This seems to be the most 
natural representation. 

The second qucsdon concerns the use of a fovea. Different sections of die images are analyzed at different 
resolutions, for a given position of the optical axes. An important consequence of diis is that die amount of 
buffer space required to store the disparity will vary widely in die visual field, being much greater for the fovea 
than for the periphery. This also suggests the use of a rednocentric representation, because if one used a frame 
that had already allowed for eye-movements, it would have to have fovcal resolution everywhere. Not only 
does such a buffer waste space, but it does not agree with our own experience as perccivers. If such a buffer 
were used, we should be able to build up a perceptual impression of the world diat was everywhere as detailed 
as it is at the centre of die gaze, and this is clearly not the case. 
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The final point about the 2<-D sketch is that it is intended as an intermediate representation of the 
current scene. It is important for such a representation to pass on its information to higher level processes as 
quickly as possible. Thus, it probably cannot wait for a representation to be built up over several positions 
of the eyes. Rather, it must be refreshed for each eye position. Thus, a refinement to the implementation, as 
outlined above, would be to use a representation that is retinocentric, and which represents disparities with 
decreasing resolution as eccentricity increases. 

For the cases illustrated in this article, the 2*-D sketch was created by storing fine resolution disparity 
values into a scene-centered representation. A second alternative is to store values from all channels into 
retinocentric representation, using disparity values from the smaller channels where available, and the coarser 
disparities from the larger channels elsewhere. In this way, a disparity representation for a single fixation of die 
eyes may be constructed, with disparity resolution varying across the rcdna. Such a mcdiod of creating the 2*- 
D sketch has been tested on the implementation, with good results. 

(2) The neighbourhood over which a search for a matching zero-crossing is conducted is broken into 
three pools. In the present implementation, the pools are used to deal with die ambiguous case of two 
matching zero-crossings, while the disparity values associated with a match are represented to within a image 
element. A second possibility is to use the pools not only to disambiguate multiple matches, but also to assign 
a disparity to a match. Thus, a single disparity value, equal to the disparity value of die midpoint of the 
pool, would be assigned for a matching zero-crossing lying anywhere within the pool. In this scheme, only 
three possible disparities could be assigned to a zero-crossing: zero, corresponding to the middle pool, or ^, 
corresponding to the divergent or convergent pools. 

Computer experiments show that eidier scheme will work. In die case of a single disparity value for each 
pool, the disparities assigned by the smallest channel are within an image element of diose obtained using 
exact disparities for each match. This modification was tried on both natural images and random dot patterns, 
and suggests that die accuracy with which the pools represent die match is not a critical factor. 

(3) Although die Marr-Poggio matcher is designed to match from one image into die other, there i 
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inherent reason why the matching process cannot be driven from both eyes independently. In fact, diere 
may be some evidence that diis is so, as is shown by die following experiment of O. Braddick (1978) on an 
extension to Panum's limiting case. First, a sparse random dot pattern was constructed. From diis pattern, a 
partner was created by displacing the entire pattern by slight amounts to bodi the left and the right. Thus, for 
each dot in the right image, there corresponded two dots in the left image, one with a small displacement to 
the left and one with a small displacement to the right. The perception obtained by viewing such a random dot 
stereogram is one of two superimposed planes. 

Suppose die matching process were only driven from one image, for example, matches were made from 
the right image to the left. In this case, the implementation would not be able to account for die Braddick 
perception, since all die zero-crossings would have two possible candidates. However, suppose that the match- 
ing-proccss were driven independently from both die right and left images, and an unambiguous match from 
either side accepted. In diis case, although every zero-crossing in the right image would have an ambiguous 
match, die implementation would obtain a unique match for each zero-crossing in the left image. The 
implementation was designed to account for matching from either image. 

Braddick's case has been tested on die implementation, and die results arc shown in figure 13. It can be 
seen that the results of die implementation arc that of two transparent planes. 

(4) The points that were incorrectly matched in die test cases all lay along depth discontinuities. The 
major reason for this is connected with occlusion of regions. Note that at any depth discontinuity, diere will 
be an occluded region which is present in one image, but not die other. Any zero-crossings within diat region 
cannot, of course, have a matching zero-crossing in the other image. However, there is a certain probability 
of such a zero-crossing being matched incorrectly to a random zero-crossing in die other image. In principle, 
die algorithm detects regions which arc occluded, by checking die statistics of die number of unmatched zero- 
crossings, and using such results to mark all zero-crossing matches in the region as unknown. However, for a 
region which contains a depth discontinuity, only part of die region will have die above characteristics. Zero- 
crossings in the rest of die region will have a unique match. Thus, when the statistical check on die number 
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of unmatched points is performed, it is possible for the entire region to be considered in range, and thus all 
matches, including the incorrect ones of the occluded region, will be accepted. 

(5) It is interesting to comment on the effect of depdi discontinuities for the different sized masks. For 
random dot patterns, the zero-crossings obtained from the larger masks tend to outline blobs or clusters of 
dots. Thus in general, die positions of the zero-crossings do not correspond to single elements of the underly- 
ing image. Suppose die dot pattern consists of one plane separated in depth from a second plane. In such a 
case, one might well find a zero-crossing that belongs at one end to dots on the first plane, and at the other end 
to dots belonging to the second plane. Such zero-crossings will be assigned disparities that reflect, to within 
the resolution of die channel, the structure of the image. The zero-crossings lying between die two ends will, 
however, receive disparities diat smoodily vary from one extreme to the other. The largest channel would thus 
not see a plane separated in depth from a second plane, but rather a smooth hump. 

For die smaller mask diis does not occur, as the zero-crossing contours tend to outline individual dots or 
connected groups of dots. Thus die disparities assigned are such that the dots belong to one plane or die other 
and the final disparity map is one of two separated planes. 

To achieve perfect results from stereo, it is probably necessary to include in die 2 1 -dimensional sketch 
a way of dealing competently with discontinuities. Some initial work has already been done in this direction 
(Grimson, in preparation). Interestingly, when one looks at a 5% random-dot stereogram portraying a square 
in front of its background, one sees vivid subjective contours at its boundary, although die output of the 
matcher does not account for this. 

(6) One consequence of the Marr-Poggio dieory is that explicit disparity values will be obtained only 
along the zero-crossing contours. It may be desirable to create a more complete reconstruction of the shapes of 
the objects in die scene, by filling in disparity vah.es between die zero-crossing contours. Some work has been 
done in this direction (Grimson, in preparation) and an example is shown in Figure 14. 

(7) An integral part of most computational theories, proposed as models of aspects of the human visual 
system, is die use of computational constraints based on assumptions about the physical world (Marr and 
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Figure 14. Example of filling in the disparity map. The top left figure is the initial image. The top right figure 
gfm^ shows the disparity map associated with the image, where the disparity is represented by the intensity of the 

point. The bottom figures show the filled in map, again using intensity to represent disparity. In the left figure, 
the full range of disparity is shown, indicating the slant of the background plane, and the extreme difference in 
disparity between the jar and die background. In the right figure, die intensities have been adjusted to enhance 
die disparities of the jar, indicating die general shape of the interpolated surface. 
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Poggio, 1979, Marr and Hildreth, 1980, Ullman, 1979). The constraints so derived are critical in die formation 
of the computational theory, and in the design of an algorithm for solving die problem. An interesting ques- 
tion to raise is whether the algorithm explicitiy checks that die constraints imposed by the theory are satisfied. 
For example, Ullman's rigidity constraint in die analysis of structure from motion is explicitly checked by his 
algorithm. For the case of the Marr-Poggio stereo dieory, two constraints were outlined, uniqueness and con- 
tinuity of disparity values. It is curious that in the algorithm used to solve the stereo problem, the continuity 
constraint is explicity checked while the uniqueness constraint is not. Uniqueness of disparity is required in 
one direction of matching, since only those zero-crossing segments of one image which have exactly one match 
in die second image are accepted. However, it may be the case that more than one element of the right image 
could be matched to an element of the left image, for matching in this direction. When matching from the 
right image to the left, the same is true. Note that one could easily alter the algoritiim to include the checking 
of uniqueness, thereby retaining only those disparity values corresponding to zero-crossing segments with a 
unique disparity value when matched from bodi images. However, the evidence of Braddick discussed above 
would indicate that this is not the case. Hence, in the Marr-Poggio stereo theory, although both the require- 
ment of uniqueness and continuity are subsumed, only one of these two constraints is explicitly checked by the 
algorithm. 

(8) It is worth observing the distinction between the performance of the implementation on random 
dot patterns and the performance of the implementation on natural images. Some examples are shown in 
Figure 15. The main point is that on the whole, the performance is quite acceptable for random dot patterns. 
However, die implementation can occasionally fail in the case of natural images. The question is whether this 
reflects a basic inadequacy in the theory and its implementation, or whether tiiere are other aspects of the 
visual process interacting with stereo which have not been included in this implementation. 

This can be approached in two ways: (1) Is the assumption of modularity incorrect? In other words, is 
there something wrong with die matching module as developed by Marr and Poggio, and as implemented 
here. (2) Are there other modules, not considered here, which may affect the input or the output of the 
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Figure 15. Examples of natural images. The top stereo pair is a scene of a basketball game. The disparity map 
below is viewed from the side, so that the width of the black bars indicates the relative disparity. The bottom 
stereo pair is of a sculpture by Henry Moore. The disparity maps below it are also viewed from the side. The 
left map illustrates the extreme range of disparity between the trees in the background and the sculpture itself. 
The right map has been adjusted to enhance the disparities of the sculpture, indicating its form. 
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The results of testing the implementation on the broad range of images, indicated in previous sections, 
seems to indicate that the matching module is acceptable as an independent one. In particular, the agreement! 
between the performance of the algorithm and that of human observers on the many random dot patterns 
seems to indicate that the matching module is acceptable, since in these cases, all other visual cues have been 
isolated from the matcher. 

When we turn to natural images, it is reasonable to expect that other visual modules may affect the input 
to the matcher and that they may alter the output of the matcher. This is not to suggest that the matcher is 
incorrect, only that the effects of otiier modules must be taken into account in order to explain the complete 
human perception. For example, die evidence of Kidd, Frisby and Mayhcw (1979) concerning die ability of 
texture boundaries to drive eye vergence movements indicates that other visual information besidesriisparity 
may alter the position of die eyes, and thus the input to die matcher. However, it does not necessarily imply 
that die matcher itself needs to be modified. 
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Interestingly, the performance of the implementation supports this point. The implementation, which is 
considered a distinct module, also performs very well on random dot patterns, where Uierc is no possibility 
of interaction with otiier visual processes. For many natural images, this is still true. However, occasionally 
it is the case that a natural image provides some difficulty for the implementation. A particular example of 
this occurs in the image of Figure 16 . Here, the regular pattern of the windows provides a strong false targets 
problem. In running die implementation, the following behavior was observed. If the optical axes were aligned 
at the level of the building, the zero-crossings corresponding to die windows were all assigned a correct dis- 
parity. If, however, the optical axes were aligned at die level of die trees in front of die building, the windows 
were assigned an incorrect disparity, due to die regular pattern of zero-crossings associated with them. Clearly, 
this seems wrong. Yet is die implementation wrong? Curiously, if one fuses the zero-crossing descriptions of 
the convolved images without eye movements, human observers have the same problem: if die eyes arc fixated 
at the level of the building, the windows arc correctly matched; if die eyes arc fixated at the level of the trees, 
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Figure 16. The false targest problem. The top figures arc a stereo pair of a group of buildings. The bottom 
figures show the zero-crossing descriptions of these images. The regular pattern of the windows of the rear 
O Chiding causes difficulties for the matcher. If the alignment of the eyes corresponds to fixating at the level 

of the building, the algorithm marches the zero-crossings corresponding to the windows correctly. If the 
alignment of die eyes corresponds to fixating at the level of the trees in front of (he building, the algorithm 
matches the zero-crossings corresponding to the windows incorrectly. 1-xperiments indicate that under similar 
conditions humans have a similar perception. 
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the windows arc incorrectly matched. I would argue that this implies that the implementation, and hence the 
theory of the matching process is in fact correct. Given a particular set of zero-crossings, the module finds 
any acceptable matching and writes it into the 2£-D sketch. However, it is probably the case that some later 
processing module, which examines the contents of the 2^-D sketch, is capable of altering the contents stored 
there, based on more global information than is available to the matching component of the stereo process. 

Thus, I would suggest that future refinements to the Marr-Poggio theory must account for die interac- 
tions of other aspects of visual information processing on the input and output of the matching module. Some 
initial work has already been done in this direction (Grimson, in preparation). 
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